20200203
From meeting with Tom Fitzgerald on 26 November 2019:
• Introgression: - Create giant population VCF - choose the datasets. Just a case of merging some VCFs. - Get some Indonesian medaka • LD decay: - LD plots - per chromosome. - Heatmap per chromosome? • Fst plot
Working directory here: /hps/research1/birney/users/ian/mikk_paper
# move to working directory
homehps
cd mikk_paper
# clone git repository
git clone https://github.com/Ian-Brettell/mikk_genome.git
# create directory for VCFs
mkdir vcfs
cp /nfs/research1/birney/projects/medaka/inbred_panel/medaka-alignments-release-94/vcf/medaka_inbred_panel_ensembl_new_reference_release_94.vcf* vcfs
mikk_genome/data/20200206_cram_id_to_line_id.txt
# Find duplicates
ssh ebi
homehps
cd mikk_paper/mikk_genome/
cat data/20200206_cram_id_to_line_id.txt | cut -f2 | cut -f1 -d"_" | sort | uniq -d
Note the following duplicates:
-106 -11 -117 -131 -132 -134 -135 -138 -14 -140 -141 -15 -23 -32 -39 -4 -40 -49 -59 -69 -7 -71 -72 -80 -84
Only take _1 sibling from pair, unless what is excluded is the only survivor based on mikk_behaviour/data/panel_1/20200109_panel_lines.txt.
Query whether we keep the lines that may have died out? Ask Felix.
mikk_genome/data/20200206_cram2line_key_no-sibs.txt
Excluded IDs: mikk_genome/data/20200206_excluded_lines.txt
20200225
Full list of MIKK lines from Felix here: mikk_genome/data/20200210_panel_lines_full.txt
cat ~/Documents/Repositories/mikk_genome/data/20200210_panel_lines_full.txt | cut -f1 -d"-" | sort | uniq -d
List with no sibling lines here: mikk_genome/data/20200227_panel_lines_no-sibs.txt. 64 lines total.
Excluded IDs here: mikk_genome/data/20200227_panel_lines_excluded.txt. 16 lines total.
Replace all dashes with underscores to match cram2line key file
sed 's/-/_/g' data/20200227_panel_lines_no-sibs.txt > data/20200227_panel_lines_no-sibs_us.txt
Extract the lines to keep from the key file.
awk 'FNR==NR {f1[$0]; next} $2 in f1' data/20200227_panel_lines_no-sibs_us.txt data/20200206_cram_id_to_line_id.txt > data/20200227_cram2line_no-sibs.txt
Has 66 lines instead of 63 (because we’re missing 130-2), so there must be duplicates. Find out which ones:
cat data/20200227_cram2line_no-sibs.txt | cut -f2 | cut -f1 -d"_" | sort | uniq -d
32 71 84
Manually removed (data/20200227_duplicates_excluded.txt):
• 24271_7#5 32_2 • 24271_8#4 71_1 • 24259_1#1 84_2
Final version: data/20200227_cram2line_no-sibs.txt
Final version, cram IDs only:
# create list of CRAM IDs in VCF
bcftools query -l vcfs/medaka_inbred_panel_ensembl_new_reference_release_94.vcf > tmp.txt
# confirm that it's in the same order as the column in the line IDs file
cut -f1 mikk_genome/data/20200206_cram_id_to_line_id.txt | tail -n+2 > tmp2.txt
# bash script to compare
file1="tmp.txt"
file2="tmp2.txt"
if cmp -s "$file1" "$file2"; then
printf 'The file "%s" is the same as "%s"\n' "$file1" "$file2"
else
printf 'The file "%s" is different from "%s"\n' "$file1" "$file2"
fi
# clean up
rm tmp*
# create file with no header
tail -n+2 mikk_genome/data/20200206_cram_id_to_line_id.txt > mikk_genome/data/20200203_cram2line_no-header.txt
# replace tab with space
sed 's/\t/ /g' mikk_genome/data/20200203_cram2line_no-header.txt > tmp.txt
mv tmp.txt mikk_genome/data/20200203_cram2line_no-header.txt
# Rename samples with BCFTOOLS
bcftools reheader --output-file vcfs/panel_line-ids.vcf --samples mikk_genome/data/20200203_cram2line_no-header.txt vcfs/medaka_inbred_panel_ensembl_new_reference_release_94.vcf
# test
bcftools query -l vcfs/panel_line-ids.vcf
#[E::bcf_hdr_add_sample] Duplicated sample name '84_2'
#[E::bcf_hdr_add_sample] Duplicated sample name '141_3'
#[E::bcf_hdr_add_sample] Duplicated sample name '32_2'
#[E::bcf_hdr_add_sample] Duplicated sample name '71_1'
#Failed to open vcfs/panel_line-ids.vcf: could not parse header
# create no-sibs file with CRAM ID only
cut -f1 mikk_genome/data/20200227_cram2line_no-sibs.txt > mikk_genome/data/20200227_cram2line_no-sibs_cram-only.txt
# pull out only samples to be included, then recode
bcftools view --output-file vcfs/panel_no-sibs.vcf --samples-file mikk_genome/data/20200227_cram2line_no-sibs_cram-only.txt vcfs/medaka_inbred_panel_ensembl_new_reference_release_94.vcf
# SUCCESS
# recode
bcftools reheader --output vcfs/panel_no-sibs_line-ids.vcf --samples mikk_genome/data/20200227_cram2line_no-sibs.txt vcfs/panel_no-sibs.vcf
# compress
## option 1: bgzip vcfs/panel_no-sibs_line-ids.vcf
## option 2:
bcftools view --output-type z --output-file vcfs/panel_no-sibs_line-ids.vcf.gz vcfs/panel_no-sibs_line-ids.vcf
# create index
bcftools index --tbi vcfs/panel_no-sibs_line-ids.vcf.gz
mkdir stats
bcftools stats vcfs/panel_no-sibs_line-ids.vcf.gz > stats/20200305_panel_no-sibs.txt
• number of samples: 63 • number of records: 29,161,024 • number of no-ALTs: 0 • number of SNPs: 24,031,673 • number of MNPs: 0 • number of indels: 5575994 • number of others: 449159 • number of multiallelic sites: 2957366 • number of multiallelic SNP sites: 1434908
• ts: 12640470 • tv: 11886484
• ts/tv: 1.06
mkdir refs
cp /nfs/research1/birney/projects/medaka/inbred_panel/medaka-alignments-release-94/ref/Oryzias_latipes.ASM223467v1* refs/
mkdir vcfs/split_by_chr
for i in $(seq 1 24); do
bsub -o log/split_by_chr_$i.out -e log/split_by_chr_$i.err \
"bcftools filter \
--regions $i \
--output-type z \
--output vcfs/split_by_chr/panel_no-sibs_chr-$i.vcf.gz \
vcfs/panel_no-sibs_line-ids.vcf.gz";
done
mkdir stats/by_chr
for i in $(seq 1 24); do
bsub -o log/stats_by_chr_$i.out -e log/stats_by_chr_$i.err \
"bcftools stats \
vcfs/split_by_chr/panel_no-sibs_chr-$i.vcf.gz > stats/by_chr/$i.txt";
done
mkdir ld
mkdir ld/20200305_panel_maf-0.03_window-50kb
for i in $(seq 1 24); do
bsub -M 30000 -n 8 -o log/ld_$i.out -e log/ld_$i.err \
"vcftools \
--gzvcf vcfs/split_by_chr/panel_no-sibs_chr-$i.vcf.gz \
--geno-r2 \
--ld-window-bp 50000 \
--maf 0.03 \
--out ld/20200305_panel_maf-0.03_window-50kb/$i";
done
# creates huge files (~10G). Take random 1M lines
## create set up directory
mkdir ld/20200305_panel_maf-0.03_window-50kb_thinned
##
for i in $(seq 1 24); do
bsub -o log/ld_trim_$i.out -e log/ld_trim_$i.err \
"head -1 > ld/20200305_panel_maf-0.03_window-50kb_thinned/$i\_1m.txt && \
shuf -n 1000000 ld/20200305_panel_maf-0.03_window-50kb/$i.geno.ld >> ld/20200305_panel_maf-0.03_window-50kb_thinned/$i\_1m.txt";
done
head -1 > ld/20200305_panel_maf-0.03_window-50kb_thinned/24_1m.txt
shuf -n 1000000 ld/20200305_panel_maf-0.03_window-50kb/24.geno.ld >> ld/20200305_panel_maf-0.03_window-50kb_thinned/24_1m.txt
#Try with plink and --thin argument
mkdir ld/20200305_panel_maf-0.03_window-50kb_plink
/nfs/software/birney/plink \
--vcf vcfs/split_by_chr/panel_no-sibs_chr-24.vcf.gz \
--double-id \
--thin 0.025 \
--snps-only \
--geno 0.3 \
--r2 \
--ld-window-r2 0 \
--ld-window 999999 \
--ld-window-kb 50000 \
--chr-set 24 \
--out ld/20200305_panel_maf-0.03_window-50kb_plink/test
# works
# TRUE
for i in $(seq 24 24); do
bsub -M 30000 -o log/ld_plink_$i.out -e log/ld_plink_$i.err \
"/nfs/software/birney/plink \
--vcf vcfs/split_by_chr/panel_no-sibs_chr-$i.vcf.gz \
--double-id \
--thin 0.025 \
--snps-only \
--geno 0.3 \
--r2 \
--ld-window-r2 0 \
--ld-window 999999 \
--ld-window-kb 50000 \
--chr-set 24 \
--out ld/20200305_panel_maf-0.03_window-50kb_plink/$i";
done
–double-id removes the issue of the line IDs having underscores in them. –thin 0.025 takes only 0.025 of the variants –snps-only takes only SNPs –geno 0.3 filters out variants with missing call rates above 0.3 –r2 gets the R^2 –ld-window-r2 0 includes all pairs of variants, including those with R^2 less than 0.2 –ld-window 999999 sets the maximum number of variants allowed between a pair of variants –ld-window-kb 50000 sets the distance of the comparison window –chr-set 24 tells plink that the chromosome number is 24 so that it doesn’t get confused that they’re not human.
for i in $(seq 1 24); do
bsub -M 20000 -n 4 -o log/missing_$i.out -e log/missing_$i.err \
"vcftools \
--gzvcf vcfs/split_by_chr/panel_no-sibs_chr-$i.vcf.gz \
--missing-site \
--out missing/$i";
done
# Find duplicates
cut -f2 mikk_genome/data/20200206_cram_id_to_line_id.txt | sort | uniq -cd
• 2 141_3 • 2 32_2 • 2 71_1 • 2 84_2
Edited them as follows:
• 2 141_3-2 • 2 32_2-2 • 2 71_1-2 • 2 84_2-2
Saved here: mikk_genome/data/20200305_cram2line_full_dupes-edited.txt
# rehead
bcftools reheader \
--output vcfs/full-run_line-ids.vcf \
--samples mikk_genome/data/20200305_cram2line_full_dupes-edited.txt \
vcfs/medaka_inbred_panel_ensembl_new_reference_release_94.vcf
# compress
bcftools view \
--output-type z \
--output-file vcfs/full-run_line-ids.vcf.gz \
vcfs/full-run_line-ids.vcf
# index
bcftools index \
--tbi vcfs/full-run_line-ids.vcf.gz
20200602
sort(as.integer(names(data_list)))
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
data_list <- lapply(data_list, function(chr){
chr$distance_kb <- abs(chr$snp_1 - chr$snp_2) / 1000
return(chr)
})
data_df <- dplyr::bind_rows(data_list)
# TEST
test_df <- dplyr::sample_n(data_df, 10000)
test_df %>%
ggplot(aes(distance_kb, r2)) +
geom_point(size = 0.1, alpha = 0.005) +
geom_smooth(size = 0.5) +
theme_bw() +
facet_wrap(~chr, nrow = 6, ncol = 4) +
xlab("Distance between SNPs (bp)") +
ylab(parse(text = "r^2"))
NA
NA
# TRUE
data_df %>%
ggplot(aes(distance_kb, r2)) +
geom_point(size = 0.1, alpha = 0.005) +
geom_smooth(size = 0.5) +
theme_bw() +
facet_wrap(~chr, nrow = 6, ncol = 4) +
xlab("Distance between SNPs (kb)") +
ylab(parse(text = "r^2"))
ggsave(filename = paste("20200602_ld_decay_subset_allchr", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 15,
height = 15,
units = "cm",
dpi = 500)
# how many?
length(which(data_df$count == 63))
[1] 9657665
ggsave(filename = paste("20200603_ld_decay_subset_allchr_allsmpls", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 20,
units = "cm",
dpi = 500)
View(full_call_df %>% dplyr::filter(distance > 20))
Looking at individual SNPs in the VCF using grep, most of them have low MAFs (3 < x < 5). Run LD calcs again using threshold of 5%.
mkdir ld/20200305_panel_maf-0.05_window-50kb
for i in $(seq 1 24); do
bsub -M 30000 -n 8 -o log/ld_$i.out -e log/ld_$i.err \
"vcftools \
--gzvcf vcfs/split_by_chr/panel_no-sibs_chr-$i.vcf.gz \
--geno-r2 \
--ld-window-bp 50000 \
--maf 0.05 \
--out ld/20200305_panel_maf-0.05_window-50kb/$i";
done
mkdir ld/20200305_panel_maf-0.03_window-50kb_full-calls
for i in $(find ld/20200305_panel_maf-0.03_window-50kb/23.geno.ld); do
name=$(basename $i | cut -f1 -d".");
awk_subscript=$(echo "awk '\$4 == 63' $i >> ld/20200305_panel_maf-0.03_window-50kb_full-calls/$name.txt");
script=$(echo "head -1 $i > ld/20200305_panel_maf-0.03_window-50kb_full-calls/$name.txt; $awk_subscript");
bsub -M 30000 -o log/ld_full-calls_$name.out -e log/ld_full-calls_$name.err $(printf $script);
done
# Can't do it like this - has a problem with the awk script. Have to do it in one job:
for i in $(find ld/20200305_panel_maf-0.03_window-50kb/*); do
name=$(basename $i | cut -f1 -d".");
head -1 $i > ld/20200305_panel_maf-0.03_window-50kb_full-calls/$name.txt;
awk '$4 == 63' $i >> ld/20200305_panel_maf-0.03_window-50kb_full-calls/$name.txt
done
R script here: mikk_genome/code/scripts/20200602_ld_decay_plot.R
for i in $(find ld/20200305_panel_maf-0.03_window-50kb_full-calls/* | head -1); do
date_today=$(date +'%Y%m%d');
name=$(basename $i | cut -f1 -d".");
bsub -M 50000 -o log/$date_today\_$name\_plotld.out -e log/$date_today\_$name\_plotld.err "Rscript --vanilla mikk_genome/code/scripts/20200602_ld_decay_plot.R $i plots/ $date_today";
done
# Can't handle it with 50MB memory
for i in $(find ld/20200305_panel_maf-0.05_window-50kb_full-calls/* | head -1); do
date_today=$(date +'%Y%m%d');
name=$(basename $i | cut -f1 -d".");
bsub -M 50000 -o log/$date_today\_plotld_$name.out -e log/$date_today\_plotld_$name.err "Rscript --vanilla mikk_genome/code/scripts/20200602_ld_decay_plot.R $i plots/ $date_today";
done
# STILL has the line at R^2 == 1!!
20200707
Try all again, this time removing indels.
mkdir ld/20200707_panel_maf-0.05_window-50kb
VCFToolsfor i in $(find vcfs/split_by_chr/*); do
date_today=$(date +'%Y%m%d');
chr=$(basename $i | awk -F'-' '{print $3}' | sed 's/.vcf.gz//');
bsub -M 30000 -o log/$date_today\_ld_$chr.out -e log/$date_today\_ld_$chr.err \
"vcftools \
--gzvcf $i \
--ld-window-bp 50000 \
--maf 0.05 \
--max-alleles 2 \
--min-alleles 2 \
--remove-indels \
--geno-r2 \
--out ld/20200707_panel_maf-0.05_window-50kb/$chr";
done
# needs --recode flag!!
vcftools \
--gzvcf vcfs/panel_no-sibs_line-ids.vcf.gz \
--max-missing 1 \
--recode \
--stdout > vcfs/panel_no-sibs_line-ids_no-missing.vcf
## compress
bcftools view --output-type z --output-file vcfs/panel_no-sibs_line-ids_no-missing.vcf.gz vcfs/panel_no-sibs_line-ids_no-missing.vcf
# create index
bcftools index --tbi vcfs/panel_no-sibs_line-ids_no-missing.vcf.gz
• number of samples: 63 • number of records: 20,086,433 • number of no-ALTs: 0 • number of SNPs: 16,956,405 • number of MNPs: 0 • number of indels: 3329681 • number of others: 0 • number of multiallelic sites: 1333182 • number of multiallelic SNP sites: 260770
# make directory
mkdir ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.99/
# run
for i in $(seq 1 24); do
date_today=$(date +'%Y%m%d');
bsub -M 30000 -o log/$date_today\_ld_maf-max_$chr.out -e log/$date_today\_ld_maf-max_$chr.err \
"vcftools \
--gzvcf vcfs/panel_no-sibs_line-ids_no-missing.vcf.gz \
--ld-window-bp 50000 \
--chr $i \
--maf 0.05 \
--max-maf 0.99 \
--max-alleles 2 \
--min-alleles 2 \
--remove-indels \
--geno-r2 \
--out ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.99/$i";
done
## create set up directory
mkdir ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.99_thinned/
## Extract random 1M pairs
for i in $(seq 1 24); do
date_today=$(date +'%Y%m%d');
bsub -o log/$date_today\_ld_trim_$i.out -e log/$date_today\_ld_trim_$i.err \
"head -1 ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.99/$i.geno.ld > ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.99_thinned/$i\_1m.txt && \
shuf -n 1000000 ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.99/$i.geno.ld >> ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.99_thinned/$i\_1m.txt";
done
## send to local
scp -r brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.99_thinned/ ~/Documents/Data/20200707_mikk_ld
data_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.99_thinned",
full.names = T)
data_files_trunc <- list.files("~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.99_thinned")
data_files_trunc <- gsub("_1m.txt", "", data_files_trunc)
data_list <- lapply(data_files, function(data_file){
df <- read.delim(data_file,
sep = "\t",
header = T)
names(df) <- c("chr", "snp_1", "snp_2", "count", "r2")
return(df)
})
names(data_list) <- as.integer(data_files_trunc)
# reorder
data_list <- data_list[order(as.integer(names(data_list)))]
data_list <- lapply(data_list, function(chr){
chr$distance_kb <- abs(chr$snp_1 - chr$snp_2) / 1000
return(chr)
})
data_df <- dplyr::bind_rows(data_list)
# TEST
test_df <- dplyr::sample_n(data_df, 10000)
test_df %>%
ggplot(aes(distance_kb, r2)) +
geom_point(size = 0.1, alpha = 0.005) +
geom_smooth(size = 0.5) +
theme_bw() +
facet_wrap(~chr, nrow = 6, ncol = 4) +
xlab("Distance between SNPs (kb)") +
ylab(parse(text = "r^2"))
NA
NA
# TRUE
data_df %>%
ggplot(aes(distance_kb, r2)) +
geom_point(size = 0.1, alpha = 0.005) +
geom_smooth(size = 0.5) +
theme_bw() +
facet_wrap(~chr, nrow = 6, ncol = 4) +
xlab("Distance between SNPs (kb)") +
ylab(parse(text = "r^2"))
ggsave(filename = paste("20200708_ld_decay_subset_allchr_max_maf_0.99", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 15,
height = 15,
units = "cm",
dpi = 500)
Looks better - with reduced density up the top for some chromosomes - but for others there are still strong lines with an r^2 of 1.
Try setting the max MAF at 0.5.
# make directory
mkdir ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.50/
# run
for i in $(seq 1 24); do
date_today=$(date +'%Y%m%d');
bsub -M 30000 -o log/$date_today\_ld_maf-max_$chr.out -e log/$date_today\_ld_maf-max_$chr.err \
"vcftools \
--gzvcf vcfs/panel_no-sibs_line-ids_no-missing.vcf.gz \
--ld-window-bp 50000 \
--chr $i \
--maf 0.05 \
--max-maf 0.50 \
--max-alleles 2 \
--min-alleles 2 \
--remove-indels \
--geno-r2 \
--out ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.50/$i";
done
## create set up directory
mkdir ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.50_thinned/
## Extract random 1M pairs
for i in $(seq 1 24); do
date_today=$(date +'%Y%m%d');
bsub -o log/$date_today\_ld_trim_$i.out -e log/$date_today\_ld_trim_$i.err \
"head -1 ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.50/$i.geno.ld > ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.50_thinned/$i\_1m.txt && \
shuf -n 1000000 ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.50/$i.geno.ld >> ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.50_thinned/$i\_1m.txt";
done
## send to local
scp -r brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.50_thinned/ ~/Documents/Data/20200707_mikk_ld
data_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.50_thinned",
full.names = T)
data_files_trunc <- list.files("~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.05_window-50kb_no-missing_maf-max-0.50_thinned")
data_files_trunc <- gsub("_1m.txt", "", data_files_trunc)
data_list <- lapply(data_files, function(data_file){
df <- read.delim(data_file,
sep = "\t",
header = T)
names(df) <- c("chr", "snp_1", "snp_2", "count", "r2")
return(df)
})
names(data_list) <- as.integer(data_files_trunc)
# reorder
data_list <- data_list[order(as.integer(names(data_list)))]
data_list <- lapply(data_list, function(chr){
chr$distance_kb <- abs(chr$snp_1 - chr$snp_2) / 1000
return(chr)
})
data_df <- dplyr::bind_rows(data_list)
# TRUE
data_df %>%
ggplot(aes(distance_kb, r2)) +
geom_point(size = 0.1, alpha = 0.005) +
geom_smooth(size = 0.5) +
theme_bw() +
facet_wrap(~chr, nrow = 6, ncol = 4) +
xlab("Distance between SNPs (kb)") +
ylab(parse(text = "r^2"))
ggsave(filename = paste("20200709_ld_decay_subset_allchr_max-maf-0.50", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 15,
height = 15,
units = "cm",
dpi = 500)
# make directory
mkdir ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90/
# run
for i in $(seq 1 24); do
date_today=$(date +'%Y%m%d');
bsub -M 30000 -o log/$date_today\_ld_maf-max_$chr.out -e log/$date_today\_ld_maf-max_$chr.err \
"vcftools \
--gzvcf vcfs/panel_no-sibs_line-ids_no-missing.vcf.gz \
--ld-window-bp 50000 \
--chr $i \
--maf 0.10 \
--max-maf 0.90 \
--max-alleles 2 \
--min-alleles 2 \
--remove-indels \
--geno-r2 \
--out ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90/$i";
done
## create set up directory
mkdir ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned/
## Extract random 1M pairs
for i in $(seq 1 24); do
date_today=$(date +'%Y%m%d');
bsub -o log/$date_today\_ld_trim_$i.out -e log/$date_today\_ld_trim_$i.err \
"head -1 ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90/$i.geno.ld > ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned/$i\_1m.txt && \
shuf -n 1000000 ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90/$i.geno.ld >> ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned/$i\_1m.txt";
done
## send to local
scp -r brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned/ ~/Documents/Data/20200707_mikk_ld
### NOTE: weird double header in chr14 file. Remove
head -1 ~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned/14_1m.txt > ~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned/14_1m.txt.tmp
grep -F -v "R^2" ~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned/14_1m.txt >> ~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned/14_1m.txt.tmp && mv ~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned/14_1m.txt.tmp ~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned/14_1m.txt
data_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned",
full.names = T)
data_files_trunc <- list.files("~/Documents/Data/20200707_mikk_ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90_thinned")
data_files_trunc <- gsub("_1m.txt", "", data_files_trunc)
data_list <- lapply(data_files, function(data_file){
df <- read.delim(data_file,
sep = "\t",
header = T)
names(df) <- c("chr", "snp_1", "snp_2", "count", "r2")
# df$snp_1 <- as.integer(df$snp_1)
# df$snp_2 <- as.integer(df$snp_2)
# df$chr <- as.integer(df$chr)
return(df)
})
names(data_list) <- as.integer(data_files_trunc)
# reorder
data_list <- data_list[order(as.integer(names(data_list)))]
data_list <- lapply(data_list, function(chr){
chr$distance_kb <- abs(chr$snp_1 - chr$snp_2) / 1000
# chr$chr <- as.integer(chr$chr)
return(chr)
})
data_df <- dplyr::bind_rows(data_list)
# TEST
test_df <- dplyr::sample_n(data_df, 10000)
test_df %>%
ggplot(aes(distance_kb, r2)) +
geom_point(size = 0.1, alpha = 0.005) +
geom_smooth(size = 0.5) +
theme_bw() +
facet_wrap(~chr, nrow = 6, ncol = 4) +
xlab("Distance between SNPs (kb)") +
ylab(parse(text = "r^2"))
NA
NA
# TRUE
data_df %>%
ggplot(aes(distance_kb, r2)) +
geom_point(size = 0.1, alpha = 0.005) +
geom_smooth(size = 0.5) +
theme_bw() +
facet_wrap(~chr, nrow = 6, ncol = 4) +
xlab("Distance between SNPs (kb)") +
ylab(parse(text = "r^2"))
ggsave(filename = paste("20200715_ld_decay_subset_allchr_min-maf-0.10_max-maf-0.90", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 15,
height = 15,
units = "cm",
dpi = 500)
# No missing
bcftools +fill-tags \
vcfs/panel_no-sibs_line-ids_no-missing.vcf.gz \
--output-type z \
--output vcfs/panel_no-sibs_line-ids_no-missing_with-maf.vcf.gz \
-- \
--tags MAF
## Count of variants: 20,086,433
# No missing, biallelic SNPs only
bcftools view \
--min-alleles 2 \
--max-alleles 2\
--types snps \
--output-type z \
--output vcfs/panel_no-sibs_line-ids_no-missing_with-maf_bi-snps.vcf.gz \
vcfs/panel_no-sibs_line-ids_no-missing_with-maf.vcf.gz
## Count of variants: 16,035,052
# make directory
mkdir maf
# get stats
bcftools query \
--format '%CHROM\t%POS\t%INFO/MAF\n' \
--output maf/20200707_maf.txt \
vcfs/panel_no-sibs_line-ids_no-missing_with-maf_bi-snps.vcf.gz
# send to local
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/maf/20200707_maf.txt ~/Documents/Data/20200707_mikk_ld
maf_dat <- read.table("~/Documents/Data/20200707_mikk_ld/20200707_maf.txt",
header = F, sep = "\t")
colnames(maf_dat) <- c("chr", "pos", "maf")
maf_dat %>%
ggplot() +
geom_histogram(aes(x = maf,
y=..count../1000000),
bins = 40,
fill = "#2A9D8F",
colour = "#264653") +
theme_bw() +
guides(fill = F) +
xlab("Minor allele frequencies") +
ylab("Count (in millions of sites)")
ggsave(filename = paste("20200707_maf_freqs", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 13,
units = "cm",
dpi = 500)
NOTE Lessons from the start of the PhD:
• Need to first recode SNPs into the 1234 format. • Then recode that into Haploview format. • Replace asterisks with 0 in PED file
# make directory for outputs
mkdir plink
# run over VCF with no missing sites
plink \
--vcf vcfs/panel_no-sibs_line-ids_no-missing.vcf.gz \
--maf 0.05 \
--make-bed \
--double-id \
--snps-only \
--biallelic-only \
--chr 1-24 \
--allow-extra-chr \
--allele1234 \
--out plink/20200707_panel_no-sibs_line-ids_no-missing_maf-0.05_1234
# recode for HV
plink \
--bfile plink/20200707_panel_no-sibs_line-ids_no-missing_maf-0.05_1234 \
--recode HV \
--maf 0.05 \
--double-id \
--snps-only \
--biallelic-only \
--chr 1-24 \
--allow-extra-chr \
--out plink/20200707_panel_no-sibs_line-ids_no-missing_maf-0.05_1234_hv
# replace * with 0
for i in $(find plink/20200707_panel_no-sibs_line-ids_no-missing_maf-0.05_1234_hv*.ped); do sed -i 's/\*/0/g' $i; done
# send to local to play with
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20200707_panel_no-sibs_line-ids_no-missing_maf-0.05_1234_hv.chr* ~/Documents/Data/20200707_mikk_ld/20200707_plink
# Too many comparisons makes Haploview crash. Try thinning.
## First send bfiles to local
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20200707_panel_no-sibs_line-ids_no-missing_maf-0.05_1234.* ~/Documents/Data/20200707_mikk_ld/20200709_plink
# recode for HV (on local)
plink \
--bfile ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200707_panel_no-sibs_line-ids_no-missing_maf-0.05_1234 \
--recode HV \
--maf 0.05 \
--double-id \
--snps-only \
--biallelic-only \
--chr 1-24 \
--allow-extra-chr \
--thin 0.05 \
--out ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.05/thinned-0.05
# replace * with 0
for i in $(find /Users/brettell/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.05/thinned-0.05*.ped); do sed -i '.bak' 's/\*/0/g' $i; done
# run Haploview
java -Xmx20G -jar ~/Documents/Software/Haploview.jar \
-pedfile ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.05/thinned-0.05.chr-24.ped \
-info ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.05/thinned-0.05.chr-24.info \
-maxDistance 1000
# still crashes. Try with 0.025
mkdir ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.025
plink \
--bfile ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200707_panel_no-sibs_line-ids_no-missing_maf-0.05_1234 \
--recode HV \
--maf 0.05 \
--double-id \
--snps-only \
--biallelic-only \
--chr 1-24 \
--allow-extra-chr \
--thin 0.025 \
--out ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.025/thinned-0.025
# replace * with 0
for i in $(find /Users/brettell/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.025/thinned-0.025*.ped); do sed -i '.bak' 's/\*/0/g' $i; done
# run Haploview
java -Xmx20G -jar ~/Documents/Software/Haploview.jar \
-pedfile ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.025/thinned-0.025.chr-24.ped \
-info ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.025/thinned-0.025.chr-24.info \
-maxDistance 1000
# Creates plot, but crashes when saving it.
gaston packagemkdir plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05
# make BED
plink \
--vcf vcfs/panel_no-sibs_line-ids_no-missing.vcf.gz \
--maf 0.05 \
--make-bed \
--double-id \
--snps-only \
--biallelic-only \
--chr 1-24 \
--allow-extra-chr \
--out plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05
# recode for 012
plink \
--bfile plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05 \
--recode A \
--chr 1-24 \
--allow-extra-chr \
--out plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05_recode-012
# recode for 012 transposed
plink \
--bfile plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05 \
--recode A-transpose \
--chr 1-24 \
--allow-extra-chr \
--out plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05_recode-012
# send to local
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05_recode-012.raw ~/Documents/Data/20200707_mikk_ld/20200714_plink
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05_recode-012.traw ~/Documents/Data/20200707_mikk_ld/20200714_plink
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05.bim ~/Documents/Data/20200707_mikk_ld/20200714_plink
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05.fam ~/Documents/Data/20200707_mikk_ld/20200714_plink
library(gaston)
library(tidyverse)
mikk <- read.table("~/Documents/Data/20200707_mikk_ld/20200714_plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05_recode-012.traw",
header = T)
# BIM
mikk.bim <- read.table("~/Documents/Data/20200707_mikk_ld/20200714_plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05.bim",
header = F,
col.names = c("chr", "id", "dist", "pos", "A1", "A2"))
# FAM
mikk.fam <- read.table("~/Documents/Data/20200707_mikk_ld/20200714_plink/20200714_panel_no-sibs_line-ids_no-missing_maf-0.05.fam",
header = F,
col.names = c("famid", "id", "father", "mother", "sex", "pheno"))
# rename sample IDs
mikk.gen <- mikk
colnames(mikk.gen)[7:ncol(mikk.gen)] <- mikk.fam$id
# remove unneeded columns
mikk.gen <- mikk.gen[, c(1, 7:ncol(mikk.gen))]
# split by chromosome
mikk.gen_lst <- split(mikk.gen, f = mikk.gen$CHR)
# remove CHR column
mikk.gen_lst <- lapply(mikk.gen_lst, function(chr){
chr$CHR <- NULL
return(chr)
})
# split BIM by chr as well
## get counts
mikk.bim %>% group_by(chr) %>% count
## split by chr
mikk.bim_lst <- split(mikk.bim, f = mikk.bim$chr)
set.seed(45)
targ_ind <- sort(sample(nrow(mikk.gen_lst$`8`), 1000))
# pull out 50 SNPs
mikk.gen_8 <- mikk.gen_lst$`8`[targ_ind, ]
mikk.gen_8 <- t(as.matrix(mikk.gen_8))
mikk.bim_8 <- mikk.bim_lst$`8`[targ_ind, ]
# create bed matrix
x <- as.bed.matrix(mikk.gen_8, bim = mikk.bim_8)
# compute LD
ld.x <- gaston::LD(x, c(1,ncol(x)))
# replace NaNs with 0
ld.x[which(is.na(ld.x))] <- 0
# plot
LD.plot( ld.x,
snp.positions = x@snps$pos,
max.dist = 1000000,
write.ld = NULL,
write.snp.id = F,
pdf.file = "~/Documents/Docs/medaka pics/20200602_mikk_genome/20200714_test.pdf")
null device
1
WORKS!
Run on all chrs
# get vector of seeds
set.seed(5)
seeds <- sample(seq(1, 100), 24)
# run over list
counter <- 0
lapply(mikk.gen_lst, function(chr){
counter <<- counter + 1
# get seed
set.seed(seeds[counter])
targ_ind <- sort(sample(nrow(mikk.gen_lst[[counter]]), 1000))
# pull out 50 SNPs
mikk.gen <- mikk.gen_lst[[counter]][targ_ind, ]
mikk.gen <- t(as.matrix(mikk.gen))
mikk.bim <- mikk.bim_lst[[counter]][targ_ind, ]
# create bed matrix
x <- as.bed.matrix(mikk.gen, bim = mikk.bim)
# compute LD
ld.x <- gaston::LD(x, c(1,ncol(x)))
# replace NaNs with 0
ld.x[which(is.na(ld.x))] <- 0
# plot
LD.plot(ld.x,
snp.positions = x@snps$pos,
max.dist = 1000000,
write.ld = NULL,
write.snp.id = F,
pdf.file = paste("~/Documents/Docs/medaka pics/20200602_mikk_genome/",
gsub("-", "", Sys.Date()),
"_chr",
counter,
".pdf",
sep = ""))
})
set.seed(5)
seeds <- sample(seq(1, 100), 24)
counter <- 0
mikk_bed_lst <- lapply(mikk.gen_lst, function(chr){
counter <<- counter + 1
# turn chr into list
chr <- list()
# get bed matrix
## set up GEN and BIM files
mikk.gen <- mikk.gen_lst[[counter]]
mikk.gen <- t(as.matrix(mikk.gen))
mikk.bim <- mikk.bim_lst[[counter]]
## form BED matrix
x <- gaston::as.bed.matrix(mikk.gen, bim = mikk.bim)
chr[["bed_mat"]] <- x
# get LD matrix
## get seed
set.seed(seeds[counter])
targ_ind <- sort(sample(nrow(mikk.gen_lst[[counter]]), 1000))
chr[["target_snps_indexes"]] <- targ_ind
## pull out select SNPs
mikk.gen <- mikk.gen_lst[[counter]][targ_ind, ]
mikk.gen <- t(as.matrix(mikk.gen))
mikk.bim <- mikk.bim_lst[[counter]][targ_ind, ]
# create bed matrix
x <- as.bed.matrix(mikk.gen, bim = mikk.bim)
# compute LD
ld.x <- gaston::LD(x, c(1,ncol(x)))
chr[["LD"]] <- ld.x
return(chr)
# # replace NaNs with 0
# ld.x[which(is.na(ld.x))] <- 0
})
unique(all_snps$chr[which(all_snps$N2 == 63)])
[1] 24
all_snps %>%
ggplot() +
geom_histogram(aes(x = maf_manual,
y=..count../1000000),
bins = 40,
fill = "#f4a261",
colour = "#e76f51") +
theme_bw() +
guides(fill = F) +
xlab("Minor allele frequencies") +
ylab("Count (in millions of sites)")
# get data frame of matrix indices with R^2 of 1
test <- data.frame(which(mikk_bed_lst[["1"]][["LD"]] == 1, arr.ind = T))
# remove diagonals
test <- test[which(test$row != test$col), ]
# get indices in full snp list
snp_1_ind <- mikk_bed_lst[["1"]][["target_snps_indexes"]][test$row]
snp_2_ind <- mikk_bed_lst[["1"]][["target_snps_indexes"]][test$col]
# get distances between those snps
high_ld_df <- data.frame(snp_1_pos = mikk_bed_lst[["1"]]$bed_mat@snps$pos[snp_1_ind],
snp_2_pos = mikk_bed_lst[["1"]]$bed_mat@snps$pos[snp_2_ind],
snp_1_maf = mikk_bed_lst[["1"]]$bed_mat@snps$maf[snp_1_ind],
snp_2_maf = mikk_bed_lst[["1"]]$bed_mat@snps$maf[snp_2_ind])
high_ld_df$distance_kb <- abs((high_ld_df$snp_2_pos - high_ld_df$snp_1_pos)/1e6)
high_ld_lst <- lapply(mikk_bed_lst, function(x){
# get data frame of matrix indices with R^2 of 1
test <- data.frame(which(x[["LD"]] > 0.9, arr.ind = T))
# remove diagonals
test <- test[which(test$row != test$col), ]
# get count
x[["high_ld_count"]] <- test
# get indices for full snp list
snp_1_ind <- mikk_bed_lst[["1"]][["target_snps_indexes"]][test$row]
snp_2_ind <- mikk_bed_lst[["1"]][["target_snps_indexes"]][test$col]
# get distances between those snps
high_ld_df <- data.frame(snp_1_pos = mikk_bed_lst[["1"]]$bed_mat@snps$pos[snp_1_ind],
snp_2_pos = mikk_bed_lst[["1"]]$bed_mat@snps$pos[snp_2_ind],
snp_1_maf = mikk_bed_lst[["1"]]$bed_mat@snps$maf[snp_1_ind],
snp_2_maf = mikk_bed_lst[["1"]]$bed_mat@snps$maf[snp_2_ind])
high_ld_df$distance_kb <- abs((high_ld_df$snp_2_pos - high_ld_df$snp_1_pos)/1e6)
x[["high_ld_df"]] <- high_ld_df
return(x)
})
test <- data.frame(high_ld_lst$`17`$LD)
colnames(test) <- high_ld_lst$`17`$bed_mat@snps$pos[high_ld_lst$`17`$target_snps_indexes]
test$snp_1_pos <- high_ld_lst$`17`$bed_mat@snps$pos[high_ld_lst$`17`$target_snps_indexes]
test2 <- pivot_longer(test,
cols = -snp_1_pos,
names_to = "snp_2_pos",
values_to = "r2")
test2$snp_1_pos <- as.integer(test2$snp_1_pos)
test2$snp_2_pos <- as.integer(test2$snp_2_pos)
test2$distance_kb <- abs((test2$snp_1_pos - test2$snp_2_pos)/1000000)
# remove diagonals
test2 <- test2[test2$distance_kb != 0, ]
counter <- 0
ld_df_lst <- lapply(high_ld_lst, function(x){
counter <<- counter + 1
test <- data.frame(x$LD)
colnames(test) <- x$bed_mat@snps$pos[x$target_snps_indexes]
test$snp_1_pos <- x$bed_mat@snps$pos[x$target_snps_indexes]
test2 <- pivot_longer(test,
cols = -snp_1_pos,
names_to = "snp_2_pos",
values_to = "r2")
test2$snp_1_pos <- as.integer(test2$snp_1_pos)
test2$snp_2_pos <- as.integer(test2$snp_2_pos)
test2$distance_kb <- abs((test2$snp_1_pos - test2$snp_2_pos)/1000)
# remove diagonals
test2 <- test2[test2$distance_kb != 0, ]
# get chr
test2$chr <- factor(names(high_ld_lst)[counter], levels = seq(1, 24))
# put in order
test2 <- test2 %>% dplyr::select(chr, snp_1_pos, snp_2_pos, distance_kb, r2)
# assign
x[["r2_df"]] <- test2
return(x)
})
# Extract into data frame
ld_df <- lapply(ld_df_lst, function(x){
x <- x$r2_df
return(x)
})
ld_df <- dplyr::bind_rows(ld_df)
# remove comparisons beyond 50kb
ld_df_50kb <- ld_df[ld_df$distance_kb < 50, ]
# Plot
ld_df_50kb %>%
ggplot(aes(distance_kb, r2)) +
geom_point(size = 0.1, alpha = 0.5) +
geom_smooth(size = 0.5) +
theme_bw() +
facet_wrap(~chr, nrow = 6, ncol = 4) +
xlab("Distance between SNPs (kb)") +
ylab(parse(text = "r^2"))
# get vector of seeds
set.seed(5)
seeds <- sample(seq(1, 100), 24)
# run over list
counter <- 0
ld_plot_lst <- lapply(mikk.gen_lst, function(chr){
chr <- list()
counter <<- counter + 1
# create BED MAT from full files
mikk.gen <- mikk.gen_lst[[counter]]
mikk.gen <- t(as.matrix(mikk.gen))
mikk.bim <- mikk.bim_lst[[counter]]
x <- as.bed.matrix(mikk.gen, bim = mikk.bim)
chr[["bed_mat_full"]] <- x
# get indexes to keep
filt_ind <- which(x@snps$maf > 0.10)
# filter from original files and make BED again
mikk.gen_filt <- mikk.gen_lst[[counter]][filt_ind, ]
mikk.gen_filt <- t(as.matrix(mikk.gen_filt))
mikk.bim_filt <- mikk.bim_lst[[counter]][filt_ind, ]
x <- as.bed.matrix(mikk.gen_filt, bim = mikk.bim_filt)
chr[["bed_mat_filt"]] <- x
# get seed
set.seed(seeds[counter])
# made GEN file again
mikk.gen_filt <- mikk.gen_lst[[counter]][filt_ind, ]
# get sample indices
targ_ind <- sort(sample(nrow(mikk.gen_filt), 1000))
# pull out sample SNPs
mikk.gen_samp <- mikk.gen_filt[targ_ind, ]
mikk.gen_samp <- t(as.matrix(mikk.gen_samp))
mikk.bim_samp <- mikk.bim_filt[targ_ind, ]
# create bed matrix
x <- as.bed.matrix(mikk.gen_samp, bim = mikk.bim_samp)
chr[["bed_mat_samp"]] <- x
# compute LD
ld.x <- gaston::LD(x, c(1,ncol(x)))
# replace NaNs with 0
ld.x[which(is.na(ld.x))] <- 0
# plot
# LD.plot(ld.x,
# snp.positions = x@snps$pos,
# max.dist = 1000000,
# write.ld = NULL,
# write.snp.id = F,
# pdf.file = paste("~/Documents/Docs/medaka pics/20200602_mikk_genome/20200715_ld_maf-0.10/",
# "20200715_chr",
# counter,
# ".pdf",
# sep = ""))
return(chr)
})
Script here: code/scripts/20200715_r2_decay_mean.R
mkdir ld/20200715_mean_r2
# TEST
Rscript --vanilla mikk_genome/code/scripts/20200715_r2_decay_mean.R \
ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90/18.geno.ld \
ld/20200715_mean_r2
# WORKS
# TRUE
for i in $(find ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90/*); do
name=$(basename $i | cut -f1 -d".");
bsub -M 50000 -n 4 -o log/20200715_$name\_mean-r2.out -e log/20200715_$name\_mean-r2.err "Rscript --vanilla mikk_genome/code/scripts/20200715_r2_decay_mean.R $i ld/20200715_mean_r2";
done
# Pull to local
scp -r brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/ld/20200715_mean_r2/ ~/Documents/Data/20200707_mikk_ld/
data_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200715_mean_r2/")
There were 28 warnings (use warnings() to see them)
r2_df %>% ggplot() +
geom_line(aes(bin_bdr, mean, colour = chr)) +
theme_bw() +
xlab("Distance beetween SNPs (bp)") +
ylab(bquote(.("Mean r")^2)) +
labs(colour = "Chromosome")
ggsave(filename = paste("20200715_mean-r2", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 13,
units = "cm",
dpi = 500)
r2_df %>% ggplot() +
geom_line(aes(bin_bdr_kb, mean, colour = chr)) +
theme_bw() +
xlab("Distance beetween SNPs (Kb)") +
ylab(bquote(.("Mean r")^2)) +
facet_wrap(~chr, nrow = 6, ncol = 4) +
guides(colour = F)
ggsave(filename = paste("20200715_mean-r2_facet", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 13,
units = "cm",
dpi = 500)
20200723 ### Do again, but shrinking the y-axis labels slighlty to avoid overlap
r2_df %>% ggplot() +
geom_line(aes(bin_bdr_kb, mean, colour = chr)) +
theme_bw() +
xlab("Distance beetween SNPs (Kb)") +
ylab(bquote(.("Mean r")^2)) +
facet_wrap(~chr, nrow = 6, ncol = 4) +
guides(colour = F) +
theme(axis.text = element_text(size = 8),
strip.text = element_text(size = 8),
panel.grid = element_blank())
ggsave(filename = paste("20200723_mean-r2_facet_50kb", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 20,
units = "cm",
dpi = 500)
end 20200723
mkdir ld/20200715_mean_r2_1kb-max
# TRUE
for i in $(find ld/20200707_panel_maf-0.10_window-50kb_no-missing_maf-max-0.90/*); do
name=$(basename $i | cut -f1 -d".");
bsub -M 30000 -n 4 -o log/20200715_$name\_mean-r2_1kb-max.out -e log/20200715_$name\_mean-r2_1kb-max.err "Rscript --vanilla mikk_genome/code/scripts/20200715_r2_decay_mean_1kb-lim.R $i ld/20200715_mean_r2_1kb-max";
done
# Pull to local
scp -r brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/ld/20200715_mean_r2_1kb-max/ ~/Documents/Data/20200707_mikk_ld/
data_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200715_mean_r2_1kb-max/",
full.names = T)
data_files_trunc <- list.files("~/Documents/Data/20200707_mikk_ld/20200715_mean_r2_1kb-max/")
data_files_trunc <- gsub(".txt", "", data_files_trunc)
data_list <- lapply(data_files, function(data_file){
df <- read.delim(data_file,
sep = "\t",
header = T)
#names(df) <- c("chr", "snp_1", "snp_2", "count", "r2")
return(df)
})
names(data_list) <- as.integer(data_files_trunc)
# reorder
data_list <- data_list[order(as.integer(names(data_list)))]
# bind into DF
r2_df <- dplyr::bind_rows(data_list, .id = "chr")
r2_df$chr <- factor(r2_df$chr, levels = seq(1, 24))
# get kb measure
r2_df$bin_bdr_kb <- r2_df$bin_bdr / 1000
data_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200715_mean_r2_1kb-max/",
full.names = T)
data_files_trunc <- list.files("~/Documents/Data/20200707_mikk_ld/20200715_mean_r2_1kb-max/")
data_files_trunc <- gsub(".txt", "", data_files_trunc)
data_list <- lapply(data_files, function(data_file){
df <- read.delim(data_file,
sep = "\t",
header = T)
#names(df) <- c("chr", "snp_1", "snp_2", "count", "r2")
return(df)
})
names(data_list) <- as.integer(data_files_trunc)
# reorder
data_list <- data_list[order(as.integer(names(data_list)))]
# bind into DF
r2_df_1kb <- dplyr::bind_rows(data_list, .id = "chr")
r2_df_1kb$chr <- factor(r2_df_1kb$chr, levels = seq(1, 24))
# get kb measure
r2_df_1kb$bin_bdr_kb <- r2_df_1kb$bin_bdr / 1000
r2_df_1kb %>% ggplot() +
geom_line(aes(bin_bdr, mean, colour = chr)) +
theme_bw() +
xlab("Distance beetween SNPs (bp)") +
ylab(bquote(.("Mean r")^2)) +
labs(colour = "Chromosome")
sum(unlist(lapply(ld_plot_lst, function(x) nrow(x[["bed_mat_filt"]]@snps))))
[1] 2938051
r2_df_1kb %>% ggplot() +
geom_line(aes(bin_bdr, mean, colour = chr)) +
theme_bw() +
xlab("Distance beetween SNPs (bp)") +
ylab(bquote(.("Mean r")^2)) +
guides(colour = F)
ggsave(filename = paste("20200715_mean-r2_1kb-lim_no-guide", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 13,
units = "cm",
dpi = 500)
20200723
Do separate ones for each chr
# create hue palette
chr_pal <- scales::hue_pal()(24)
lapply(seq(1:24), function(x){
# plot
plot <- r2_df_1kb %>%
dplyr::filter(chr == x) %>%
ggplot() +
geom_line(aes(bin_bdr, mean, colour = chr)) +
scale_colour_manual(values = chr_pal[x]) +
theme_bw() +
xlab("Distance beetween SNPs (bp)") +
ylab(bquote(.("Mean r")^2)) +
guides(colour = F) +
theme(panel.grid = element_blank())
# save
ggsave(filename = paste(x, ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/20200723_decay-by-chr_1kb",
width = 15,
height = 8.25,
units = "cm",
dpi = 500)
})
Insets are 3 x 1.65
end 20200723
library(gaston)
Loading required package: Rcpp
Loading required package: RcppParallel
Attaching package: ‘RcppParallel’
The following object is masked from ‘package:Rcpp’:
LdFlags
Gaston set number of threads to 2. Use setThreadOptions() to modify this.
Attaching package: ‘gaston’
The following object is masked from ‘package:stats’:
sigma
The following objects are masked from ‘package:base’:
cbind, rbind
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ─────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.2 ✓ purrr 0.3.4
✓ tibble 3.0.3 ✓ dplyr 1.0.0
✓ tidyr 1.1.0 ✓ stringr 1.4.0
✓ readr 1.3.1 ✓ forcats 0.5.0
── Conflicts ────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
20200715 Birney Group meeting
Get human LD plot – contrast. Major one we want to make. And it’s not quite as good as Drosophila and accept that.
Tracking down these LD blocks. Investigate individually. Find explanation for one. Describe the reason why.
For these variants in the blocks, the samples will split be ancestry. As we raise the MAF, we hide the unbalanced scenarios. Take SNPs in blocks. Make a matrix. SNPs by samples, then cluster. Try Hclust. Hopefully we have some on one side of the tree, some on the other. Tree won’t look balanced, but hopefully we have a sample on each side.
The other thing they could be is introgression from Northern medaka line. Came in and refuses to recombine with Southern medaka.
Could be regions of persistent heterozygosity? Sample regions of persistent heterozygosity and make bandage plots for them.
Let Felix know that it’s working out.
mkdir plink/20200716_panel_no-sibs_line-ids_no-missing
# make BED
plink \
--vcf vcfs/panel_no-sibs_line-ids_no-missing.vcf.gz \
--make-bed \
--double-id \
--snps-only \
--biallelic-only \
--chr-set 24 no-xy \
--chr 1-24 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200716
# recode for 012 transposed
plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200716 \
--recode A-transpose \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200716_recode012
# compress
gzip -k plink/20200716_panel_no-sibs_line-ids_no-missing/20200716_recode012.traw
gzip -k plink/20200716_panel_no-sibs_line-ids_no-missing/20200716.bim
# send to local
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20200716_panel_no-sibs_line-ids_no-missing/20200716_recode012.traw.gz ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20200716_panel_no-sibs_line-ids_no-missing/20200716.bim.gz ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20200716_panel_no-sibs_line-ids_no-missing/20200716.bed ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20200716_panel_no-sibs_line-ids_no-missing/20200716.fam ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set
# decompress
gunzip ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716_recode012.traw.gz
gunzip ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716.bim.gz
library(gaston)
library(tidyverse)
# try with read.bed.matrix
mikk_full <- gaston::read.bed.matrix("~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716",
rds = NULL)
# split by chromosome
mikk_full_split <- lapply(seq(1, 24), function(x){
gaston::select.snps(mikk_full, chr == x)
})
names(mikk_full_split) <- seq(1, 24)
# make list with filtered bed.matrixes based on MAF
## create vector of target MAFs
target_mafs <- c(0.01, 0.03, 0.05, 0.1)
## filter
mikk_maf_filts <- lapply(target_mafs, function(x){
lapply(mikk_full_split, function(y){
gaston::select.snps(y, maf > x)
})
})
names(mikk_maf_filts) <- target_mafs
seq 1 24 > tmp1.txt
grep ">" refs/Oryzias_latipes.ASM223467v1.dna.toplevel.fa | scut -f6 -d":" | head -24 > tmp2.txt
paste tmp1.txt tmp2.txt > mikk_genome/data/Oryzias_latipes.ASM223467v1.dna.toplevel.fa_chr_counts.txt
chr_counts <- readr::read_tsv("~/Documents/Repositories/mikk_genome/data/Oryzias_latipes.ASM223467v1.dna.toplevel.fa_chr_counts.txt",
col_names = c("chr", "length"))
# get vector of seeds
set.seed(5)
seeds <- sample(seq(1, 100), 24)
# run over list
counter <- 0
test <- lapply(mikk_maf_filts$`0.03`, function(chr){
counter <<- counter + 1
# get seed
set.seed(seeds[counter])
# get target indices for subsample
targ_ind <- sort(sample(nrow(chr@snps), 1000))
# convert to logical vector
log_vec <- seq(1, nrow(chr@snps)) %in% targ_ind
# thin BED for sample SNPs
thin_bed <- gaston::LD.thin(chr, threshold = 1, max.dist = 50000, which.snps = log_vec)
# compute LD
ld.x <- gaston::LD(thin_bed, c(1,ncol(thin_bed)))
# replace NaNs with 0
ld.x[which(is.na(ld.x))] <- 0
# plot
LD.plot(ld.x,
snp.positions = thin_bed@snps$pos,
max.dist = 1000000,
write.ld = NULL,
write.snp.id = F,
pdf.file = paste("~/Documents/Docs/medaka pics/20200602_mikk_genome/20200717_ld_maf-0.03/",
"20200717_chr",
counter,
".pdf",
sep = ""))
})
K <- GRM(mikk_full)
heatmap(K)
# get vector of seeds
set.seed(10)
seeds <- sample(seq(1, 100), 24)
# run over list
counter <- 0
test <- lapply(mikk_maf_filts$`0.05`, function(chr){
counter <<- counter + 1
# get seed
set.seed(seeds[counter])
# get target indices for subsample
targ_ind <- sort(sample(nrow(chr@snps), 1000))
# convert to logical vector
log_vec <- seq(1, nrow(chr@snps)) %in% targ_ind
# thin BED for sample SNPs
thin_bed <- gaston::LD.thin(chr, threshold = 1, max.dist = 50000, which.snps = log_vec)
# compute LD
ld.x <- gaston::LD(thin_bed, c(1,ncol(thin_bed)))
# replace NaNs with 0
ld.x[which(is.na(ld.x))] <- 0
# plot
LD.plot(ld.x,
snp.positions = thin_bed@snps$pos,
max.dist = 1000000,
write.ld = NULL,
write.snp.id = F,
pdf.file = paste("~/Documents/Docs/medaka pics/20200602_mikk_genome/20200717_ld_maf-0.05/",
"20200717_chr",
counter,
".pdf",
sep = ""))
})
Calculate LD blocks with Plink
mkdir plink/20200716_panel_no-sibs_line-ids_no-missing/20200717_blocks
for i in $(seq 1 24); do \
bsub -M 30000 -o log/20200717_blocks_$i.out -e log/20200717_blocks_$i.err \
"plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200716 \
--chr $i \
--blocks no-pheno-req no-small-max-span \
--chr-set 24 no-xy \
--blocks-max-kb 1000 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200717_blocks/$i"; \
done
# example for chr 5
java -Xmx20G -jar ~/Documents/Software/Haploview.jar \
-pedfile ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.025/thinned-0.025.chr-5.ped \
-info ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.025/thinned-0.025.chr-5.info \
-maxDistance 1000
# did for chrs 5, 6, 16, 17
java -Xmx20G -jar ~/Documents/Software/Haploview.jar \
-pedfile ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.025/thinned-0.025.chr-16.ped \
-info ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.025/thinned-0.025.chr-16.info \
-maxDistance 1000
java -Xmx20G -jar ~/Documents/Software/Haploview.jar \
-pedfile ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.025/thinned-0.025.chr-17.ped \
-info ~/Documents/Data/20200707_mikk_ld/20200709_plink/20200709_thinned-0.025/thinned-0.025.chr-17.info \
-maxDistance 1000
Works OK for chrs 5 and 17… • 5: Block 24 (363 Kb). IDs 3526 to 3631 (28418434:28778410) • 17: Blocks 18-36 (4 Mb). IDs 2549 to 3521 (15571645:19618234)
But not for 6 and 16. Try creating a less-thinned PED using the full BED that is sitting on the local
plink \
--bfile ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716 \
--recode HV \
--maf 0.05 \
--double-id \
--chr-set 24 no-xy \
--chr 6 \
--thin 0.05 \
--allele1234 \
--out ~/Documents/Data/20200707_mikk_ld/20200721_plink_hv/20200721_thinned_0.05/20200721
for i in $(find ~/Documents/Data/20200707_mikk_ld/20200721_plink_hv/20200721_thinned_0.05/20200721*.ped); do sed -i 's/\*/0/g' $i; done
sed -i '.bak' 's/\*/0/g' ~/Documents/Data/20200707_mikk_ld/20200721_plink_hv/20200721_thinned_0.05/20200721.chr-6.ped
java -Xmx20G -jar ~/Documents/Software/Haploview.jar \
-pedfile ~/Documents/Data/20200707_mikk_ld/20200721_plink_hv/20200721_thinned_0.05/20200721.chr-6.ped \
-info ~/Documents/Data/20200707_mikk_ld/20200721_plink_hv/20200721_thinned_0.05/20200721.chr-6.info \
-maxDistance 1000
# Thin 0.075
mkdir ~/Documents/Data/20200707_mikk_ld/20200721_plink_hv/20200721_thinned_0.075/
plink \
--bfile ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716 \
--allele1234 \
--recode HV \
--maf 0.05 \
--double-id \
--chr-set 24 no-xy \
--chr 6,16 \
--thin 0.075 \
--out ~/Documents/Data/20200707_mikk_ld/20200721_plink_hv/20200721_thinned_0.075/20200721
for i in $(find ~/Documents/Data/20200707_mikk_ld/20200721_plink_hv/20200721_thinned_0.075/20200721*.ped); do sed -i '.bak' 's/\*/0/g' $i; done
java -Xms6G -Xmx6G -jar ~/Documents/Software/Haploview.jar \
-memory 5000 \
-ldvalues RSQ \
-pedfile ~/Documents/Data/20200707_mikk_ld/20200721_plink_hv/20200721_thinned_0.075/20200721.chr-6.ped \
-info ~/Documents/Data/20200707_mikk_ld/20200721_plink_hv/20200721_thinned_0.075/20200721.chr-6.info \
-maxDistance 1000 \
-compressedpng \
-nogui
#Exception in thread "main" java.lang.NoSuchMethodError: java.lang.System.runFinalizersOnExit(Z)V
• 6: From 0.05: Block 112-129 (1.9 Mb). IDs 7693 to 8249 (30372030:32229289) From 0.075: Block 202-224. IDs 11671 to 12491 (30417868:32245570) • 16: From 0.075: N/A. Can’t identify. Try with LD colour scheme
java -Xms6G -Xmx6G -jar ~/Documents/Software/Haploview.jar \
-memory 5000 \
-pedfile ~/Documents/Data/20200707_mikk_ld/20200721_plink_hv/20200721_thinned_0.075/20200721.chr-16.ped \
-info ~/Documents/Data/20200707_mikk_ld/20200721_plink_hv/20200721_thinned_0.075/20200721.chr-16.info \
-maxDistance 1000 \
-ldcolorscheme RSQ
# No difference
# recode for HV
plink \
--bfile ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716 \
--recode HV \
--double-id \
--chr-set 24 no-xy \
--allele1234 \
--out ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200721_hv
## And do it on the cluster too
mkdir plink/20200716_panel_no-sibs_line-ids_no-missing/20200722_haploview
plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200716 \
--recode HV \
--double-id \
--chr-set 24 no-xy \
--allele1234 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200722_haploview/20200722_haploview
# remove asterisks on cluster
for i in $(find plink/20200716_panel_no-sibs_line-ids_no-missing/20200722_haploview/20200722_haploview*.ped); do sed -i 's/\*/0/g' $i; done
# remove asterisks on local
for i in $(find ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200721_hv*.ped); do sed -i '.bak' 's/\*/0/g' $i; done
rm ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200721_hv*.bak
# run on chr 16 directly
java -Xms10G -Xmx10G -jar ~/Documents/Software/Haploview.jar \
-memory 8000 \
-pedfile ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200721_hv.chr-16.ped \
-info ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200721_hv.chr-16.info \
-maxDistance 1000 \
-ldcolorscheme RSQ \
-ldvalues RSQ \
-minMAF 0.05 \
-spacing 0.925
# Try with different parameters
java -Xms15G -Xmx15G -jar ~/Documents/Software/Haploview.jar \
-memory 14000 \
-pedfile ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200721_hv.chr-16.ped \
-info ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200721_hv.chr-16.info \
-maxDistance 50 \
-ldcolorscheme RSQ \
-ldvalues RSQ \
-minMAF 0.05 \
-spacing 0.99
# try running on cluster
java -Xms20G -Xmx20G -jar /nfs/software/birney/Haploview.jar \
-memory 18000 \
-pedfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200722_haploview/20200722_haploview.chr-16.ped \
-info plink/20200716_panel_no-sibs_line-ids_no-missing/20200722_haploview/20200722_haploview.chr-16.info \
-maxDistance 100 \
-ldcolorscheme RSQ \
-ldvalues RSQ \
-minMAF 0.05 \
-spacing 0.95 \
-nogui \
-compressedpng
# Writing output to 20200722_haploview.chr-16.ped.LD.PNG
# Fatal Error:
# Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
java -Xms45G -Xmx45G -jar /nfs/software/birney/Haploview.jar \
-memory 40000 \
-pedfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200722_haploview/20200722_haploview.chr-16.ped \
-info plink/20200716_panel_no-sibs_line-ids_no-missing/20200722_haploview/20200722_haploview.chr-16.info \
-maxDistance 100 \
-ldcolorscheme RSQ \
-ldvalues RSQ \
-minMAF 0.05 \
-spacing 0.95 \
-nogui \
-compressedpng
# Try thinning to different degrees with plink
mkdir plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned
mkdir plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000
mkdir plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1500
# 1000
for i in $(seq 1 24); do
plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200716 \
--recode HV-1chr \
--double-id \
--chr-set 24 no-xy \
--chr $i \
--allele1234 \
--thin-count 1000 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-$i;
done
## remove asterisks
for i in $(find plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-*.ped); do sed -i 's/\*/0/g' $i; done
# Plot
java -Xms38G -Xmx38G -jar /nfs/software/birney/Haploview.jar \
-memory 38000 \
-pedfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-16.ped \
-info plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-16.info \
-maxDistance 1000 \
-ldcolorscheme RSQ \
-ldvalues RSQ \
-minMAF 0.05 \
-nogui \
-svg \
-out plots/20200723_ld_thinned_1000/16
# And use this site to convert to PDF! https://www.zamzar.com/ (actually this one is free and works as well: tps://onlineconvertfree.com/)
# Try with 3000
for i in $(seq 1 24); do
plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200716 \
--recode HV-1chr \
--double-id \
--chr-set 24 no-xy \
--chr $i \
--allele1234 \
--thin-count 3000 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/3000/20200723_chr-$i;
done
## remove asterisks
for i in $(find plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/3000/20200723_chr-*.ped); do sed -i 's/\*/0/g' $i; done
# plot
java -Xms38G -Xmx38G -jar /nfs/software/birney/Haploview.jar \
-memory 38000 \
-pedfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/3000/20200723_chr-16.ped \
-info plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/3000/20200723_chr-16.info \
-maxDistance 1000 \
-ldcolorscheme RSQ \
-ldvalues RSQ \
-minMAF 0.05 \
-nogui \
-svg \
-out plots/20200723_ld_thinned_3000/16
# try with 10000 and filter for MAF of 0.05 when creating plink HVs
for i in $(seq 1 24); do
plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200716 \
--recode HV-1chr \
--double-id \
--chr-set 24 no-xy \
--chr $i \
--allele1234 \
--maf 0.05 \
--thin-count 10000 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/10000/20200723_chr-$i;
done
# It filters for MAF AFTER thinning. Create new BED set filtered by MAF
mkdir plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_maf-0.05
plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200716 \
--make-bed \
--double-id \
--chr-set 24 no-xy \
--maf 0.05 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_maf-0.05/20200723
# Leaves 4.36M SNPs
for i in $(seq 1 24); do
plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_maf-0.05/20200723 \
--recode HV-1chr \
--double-id \
--chr-set 24 no-xy \
--chr $i \
--allele1234 \
--thin-count 10000 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/10000/20200723_chr-$i;
done
## remove asterisks
for i in $(find plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/10000/20200723_chr-*.ped); do sed -i 's/\*/0/g' $i; done
## Plot
java -Xms38G -Xmx38G -jar /nfs/software/birney/Haploview.jar \
-memory 38000 \
-pedfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/10000/20200723_chr-16.ped \
-info plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/10000/20200723_chr-16.info \
-maxDistance 1000 \
-ldcolorscheme RSQ \
-ldvalues RSQ \
-minMAF 0.05 \
-nogui \
-svg \
-out plots/20200723_ld_thinned_10000/16
# Takes a while. Try with 1000
for i in $(seq 1 24); do
plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_maf-0.05/20200723 \
--recode HV-1chr \
--double-id \
--chr-set 24 no-xy \
--chr $i \
--allele1234 \
--thin-count 1000 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-$i;
done
## remove asterisks
for i in $(find plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-*.ped); do sed -i 's/\*/0/g' $i; done
# Plot
java -Xms38G -Xmx38G -jar /nfs/software/birney/Haploview.jar \
-memory 38000 \
-pedfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-16.ped \
-info plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-16.info \
-maxDistance 1000 \
-ldcolorscheme RSQ \
-ldvalues RSQ \
-minMAF 0.05 \
-nogui \
-svg \
-out plots/20200723_ld_thinned_1000/16
# Works well. Let's try with 3000
for i in $(seq 1 24); do
plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_maf-0.05/20200723 \
--recode HV-1chr \
--double-id \
--chr-set 24 no-xy \
--chr $i \
--allele1234 \
--thin-count 3000 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/3000/20200723_chr-$i;
done
## remove asterisks
for i in $(find plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/3000/20200723_chr-*.ped); do sed -i 's/\*/0/g' $i; done
# Plot
java -Xms38G -Xmx38G -jar /nfs/software/birney/Haploview.jar \
-memory 38000 \
-pedfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/3000/20200723_chr-16.ped \
-info plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/3000/20200723_chr-16.info \
-maxDistance 1000 \
-ldcolorscheme RSQ \
-ldvalues RSQ \
-minMAF 0.05 \
-nogui \
-svg \
-out plots/20200723_ld_thinned_3000/16
# Looks good. Probably about the limit.
# Try adding BP position to the plots
awk -v OFS="\t" {'print $2,$2'} plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-16.info > plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-16.info.test
# Plot
java -Xms38G -Xmx38G -jar /nfs/software/birney/Haploview.jar \
-memory 38000 \
-pedfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-16.ped \
-info plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-16.info.test \
-maxDistance 1000 \
-ldcolorscheme RSQ \
-ldvalues RSQ \
-minMAF 0.05 \
-nogui \
-svg \
-out plots/20200723_ld_thinned_1000/16_withID
# WORKS. Do for all in 3000
for i in $(find plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/3000/20200723_chr*.info); do
outname=$(echo $i\_with-id);
awk -v OFS="\t" {'print $2,$2'} $i > $outname;
done
# Plot all
for i in $(seq 1 23); do
bsub -M 20000 -o log/20200723_hv_3000_$i.out -e log/20200723_hv_3000_$i.err \
"java -Xms18G -Xmx18G -jar /nfs/software/birney/Haploview.jar \
-memory 18000 \
-pedfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/3000/20200723_chr-$i.ped \
-info plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/3000/20200723_chr-$i.info_with-id \
-maxDistance 1000 \
-ldcolorscheme DEFAULT \
-ldvalues RSQ \
-minMAF 0.05 \
-nogui \
-svg \
-out plots/20200723_ld_thinned_3000/$i";
done
# Do for 1000 as well
for i in $(seq 1 24); do
bsub -M 20000 -o log/20200723_hv_1000_$i.out -e log/20200723_hv_1000_$i.err \
"java -Xms18G -Xmx18G -jar /nfs/software/birney/Haploview.jar \
-memory 18000 \
-pedfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-$i.ped \
-info plink/20200716_panel_no-sibs_line-ids_no-missing/20200723_hv_thinned/1000/20200723_chr-$i.info_with-id \
-maxDistance 1000 \
-ldcolorscheme DEFAULT \
-ldvalues RSQ \
-minMAF 0.05 \
-nogui \
-svg \
-out plots/20200723_ld_thinned_1000/$i";
done
# copy to local
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plots/20200723_ld_thinned_1000/* /Users/brettell/Documents/Docs/medaka\ pics/20200602_mikk_genome/20200723_haploview/20200723_1000/
svgs
# Use this online tool to convert: https://onlineconvertfree.com/convert-format/svg-to-pdf/
Final PDFs here: ~/Documents/Docs/medaka\ pics/20200602_mikk_genome/20200723_haploview
Based on these plots, the LD blocks to investigate are here:
• 5:28181970-28970558 (788 Kb) • 6:29398579-32246747 (2.85 Mb) • 12:25336174-25384053 (48 Kb) • 14:12490842-12947083 (456 Kb) • 17:15557892-19561518 (4 Mb) • 21:6710074-7880374 (1.17 Mb)
trio.BiocManager::install("trio")
library(trio)
gaston to create the matrixlibrary(gaston)
# try with read.bed.matrix
mikk_full <- gaston::read.bed.matrix("~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716",
rds = NULL)
# split by chromosome
mikk_full_split <- lapply(seq(1, 24), function(x){
gaston::select.snps(mikk_full, chr == x)
})
names(mikk_full_split) <- seq(1, 24)
# make list with filtered bed.matrixes based on MAF
## create vector of target MAFs
target_mafs <- c(0.03, 0.05)
## filter
mikk_maf_filts <- lapply(target_mafs, function(x){
lapply(mikk_full_split, function(y){
gaston::select.snps(y, maf > x)
})
})
names(mikk_maf_filts) <- target_mafs
trio requires genotype matrix so import
# make PED from BED
plink \
--bfile ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716 \
--recode \
--out ~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716
# PED is 4.13GB. Delete.
test <- trio::read.pedfile("~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716.ped",
p2g = T,
non.rs.IDs = T)
## Takes too long.
Try using snpStats
library(snpStats)
mikk_snpstats <- snpStats::read.plink("~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716")
#non-unique value when setting 'row.names': ‘.’Error in `.rowNamesDF<-`(x, value = value) : duplicate 'row.names' are not allowed
Back to manual
# read in data
mikk_geno <- readr::read_tsv(file = "~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716_recode012.traw",
progress = T,
col_names = T)
# split
mikk_geno_split <- split(mikk_geno, f = mikk_geno$CHR)
# remove redundant columns, set up col and row names, and convert to matrix
counter <- 0
mikk_geno_split <- lapply(mikk_geno_split, function(x){
counter <<- counter + 1
x <- x[, -(1:6)]
colnames(x) <- mikk_full@ped$id
x <- as.matrix(x)
rownames(x) <- paste(mikk_full_split[[counter]]@snps$chr,
mikk_full_split[[counter]]@snps$pos,
sep = ":")
return(x)
})
# get SNPs with MAF > 0.05
target_snps <- which(mikk_full_split$`6`@snps$maf > 0.05)
# pull out those SNPs
mikk_6 <- mikk_geno_split[["6"]][target_snps, ]
#test <- trio::findLDblocks(mikk_6,
# snp.in.col = F)
# Error in trio::findLDblocks(mikk_geno_split[["6"]], snp.in.col = F) : If x is a matrix, x must contain at most 500 SNPs. If x contains more SNPs, please consider to use getLDlarge.
test <- trio::getLDlarge(mikk_6,
neighbors = 25,
snp.in.col = F,
addVarN = T)
test2 <- trio::findLDblocks(test)
# takes hours to run. Try with fewer neighbours
test <- trio::getLDlarge(mikk_6,
neighbors = 10,
snp.in.col = F,
addVarN = T)
test2 <- trio::findLDblocks(test)
LDExplorerinstall.packages("LDExplorer", repos="http://R-Forge.R-project.org")
library(LDExplorer)
# get vector of seeds
set.seed(10)
seeds <- sample(seq(1, 100), 24)
# run over list
counter <- 0
test <- lapply(mikk_maf_filts$`0.05`[6], function(chr){
counter <<- counter + 1
# get seed
set.seed(seeds[counter])
# set boundaries
bounds <- c(25000000, max(chr@snps$pos))
# get sample of 1000
targ_ind <- sort(sample(which(dplyr::between(chr@snps$pos, bounds[1], bounds[2])), 1000))
# convert into logical vector
log_vec <- seq(1, nrow(chr@snps)) %in% targ_ind
# thin BED for sample SNPs
thin_bed <- gaston::LD.thin(chr,
threshold = 1,
max.dist = 50000,
which.snps = log_vec,
dist.unit = "bases")
# compute LD
ld.x <- gaston::LD(thin_bed, c(1,ncol(thin_bed)))
# replace NaNs with 0
ld.x[which(is.na(ld.x))] <- 0
# plot
LD.plot(ld.x,
snp.positions = thin_bed@snps$pos,
max.dist = 1000000,
write.ld = NULL,
write.snp.id = F,
pdf.file = paste("~/Documents/Docs/medaka pics/20200602_mikk_genome/20200717_ld_maf-0.05/zoomed/",
"20200717_chr6_25Mb-end",
".pdf",
sep = "")
)
})
Not easy to identify. Import Plink blocks
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20200716_panel_no-sibs_line-ids_no-missing/20200717_blocks/*det ~/Documents/Data/20200707_mikk_ld/20200717_plink_blocks
block_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200717_plink_blocks", full.names = T)
file_names <- gsub(".blocks.det", "", basename(block_files))
block_lst <- lapply(block_files, function(x){
readr::read_delim(x,
delim = " ",
col_names = T,
trim_ws = T)
})
names(block_lst) <- file_names
# reorder
block_lst <- block_lst[order(as.integer(names(block_lst)))]
chr6 <- block_lst$`6`
chr6 <- chr6[order(chr6$KB, decreasing = T), ]
# make sequence of 50 kb intervals... too large. Too many hits within each interval. Try 5kb
ints <- seq(1, chr_counts$length[chr_counts$chr == "6"], by = 1000)
#ggplot2::cut_width(c(1, chr6$BP1, chr_counts$length[chr_counts$chr == "6"]), width = 50000)
# create vector of intervals
chr6_bp1 <- ggplot2::cut_interval(c(1, chr6$BP1, chr_counts$length[chr_counts$chr == "6"]),
length = 1000,
labels = F)
# remove first and last entries
chr6_bp1 <- chr6_bp1[2:(length(chr6_bp1) - 1)]
# test for difference in number between ints and unique values of chr6_bp1
length(unique(chr6_bp1))
# use ifelse to create vector of values of either 0 or 100
chr6_yesno <- ifelse(seq(1, length(ints)) %in% chr6_bp1, 100, 0)
# bind both into data frame
chr6_df <- data.frame(ints,
"yesno" = chr6_yesno)
# plot
ggplot(data = chr6_df) +
geom_col(aes(ints, yesno))
# find lengths of consecutive runs
chr6_df$consec <- sequence(rle(chr6_df$yesno)$lengths)
# find highest
max(sequence(rle(chr6_df$yesno)$lengths))
# find which one
which(sequence(rle(chr6_df$yesno)$lengths) == 38)
# that gathers consecutive runs of 0 as well. We want the consecutive runs of 100
chr6_df$consec <- sequence(rle(chr6_df$yesno == 100)$lengths)
# find maximum of 100s
max(chr6_df$consec[chr6_df$yesno == 100])
hist(chr6_df$consec[chr6_df$yesno == 100])
table(chr6_df$consec[chr6_df$yesno == 100])
#
The appearance of clusters changes depending on how zoomed-in you are to the plot.
20200721
Try running Plink --blocks agains with stricter thresholds
Note also that I raised the max blocks KB range to 10MB
mkdir plink/20200716_panel_no-sibs_line-ids_no-missing/20200721_blocks_lowci_0.85
for i in $(seq 1 24); do \
bsub -M 30000 -o log/20200721_blocks_$i.out -e log/20200721_blocks_$i.err \
"plink \
--bfile plink/20200716_panel_no-sibs_line-ids_no-missing/20200716 \
--chr $i \
--blocks no-pheno-req no-small-max-span \
--blocks-strong-lowci 0.8 \
--chr-set 24 no-xy \
--blocks-max-kb 10000 \
--out plink/20200716_panel_no-sibs_line-ids_no-missing/20200721_blocks_lowci_0.85/$i"; \
done
Import Plink blocks
scp brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/plink/20200716_panel_no-sibs_line-ids_no-missing/20200721_blocks_lowci_0.85/*det ~/Documents/Data/20200707_mikk_ld/20200721_plink_blocks
block_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200721_plink_blocks", full.names = T)
file_names <- gsub(".blocks.det", "", basename(block_files))
block_lst <- lapply(block_files, function(x){
readr::read_delim(x,
delim = " ",
col_names = T,
trim_ws = T)
})
names(block_lst) <- file_names
# reorder
block_lst <- block_lst[order(as.integer(names(block_lst)))]
test <- block_lst[["6"]]
test <- test[order(test$KB, decreasing = T), ]
library(gaston)
# try with read.bed.matrix
mikk_full <- gaston::read.bed.matrix("~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716",
rds = NULL)
# split by chromosome
mikk_full_split <- lapply(seq(1, 24), function(x){
gaston::select.snps(mikk_full, chr == x)
})
names(mikk_full_split) <- seq(1, 24)
# make list with filtered bed.matrixes based on MAF
## create vector of target MAFs
target_mafs <- c(0.03, 0.05)
## filter
mikk_maf_filts <- lapply(target_mafs, function(x){
lapply(mikk_full_split, function(y){
gaston::select.snps(y, maf > x)
})
})
names(mikk_maf_filts) <- target_mafs
data.frame("maf" = mikk_full@snps$maf) %>%
ggplot() +
geom_histogram(aes(x = maf,
y=..count../1000000),
bins = 40,
fill = "#2A9D8F",
colour = "#264653") +
theme_bw() +
guides(fill = F) +
xlab("Minor allele frequencies") +
ylab("Count (in millions of sites)")
ggsave(filename = paste("20200723_maf_freqs", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 13,
units = "cm",
dpi = 500)
cp /hps/research1/birney/users/ian/rac_hyp/vcfs/1gk_all.vcf.gz* /hps/research1/birney/users/ian/mikk_paper/vcfs
# needs --recode flag!!
vcftools \
--gzvcf vcfs/1gk_all.vcf.gz \
--max-missing 1 \
--recode \
--stdout > vcfs/1gk_all_no-missing.vcf
# Made a HUGE file - 298GB and only up to chr 6!
# VCF has 84,739,838 variants!
# Try different tack.
## compress
#bcftools view --output-type z --output-file vcfs/1gk_all_no-missing.vcf.gz vcfs/1gk_all_no-missing.vcf
## create index
#bcftools index --tbi vcfs/1gk_all_no-missing.vcf.gz
mkdir plink/20200723_1gk_no-missing_maf-0.05
# make BED
plink \
--vcf vcfs/1gk_all.vcf.gz \
--make-bed \
--double-id \
--snps-only \
--biallelic-only \
--maf 0.05 \
--geno 0 \
--out plink/20200723_1gk_no-missing_maf-0.05/20200723
# 7110223 SNPs remaining
# make another with no MAF limit
mkdir plink/20200723_1gk_no-missing
plink \
--vcf vcfs/1gk_all.vcf.gz \
--make-bed \
--double-id \
--snps-only \
--biallelic-only \
--geno 0 \
--out plink/20200723_1gk_no-missing/20200723
mkdir ld/20200724_1gk_maf-0.10_window-50kb_no-missing/
for i in $(seq 1 21); do
plink \
--bfile plink/20200723_1gk_no-missing_maf-0.05/20200723 \
--r2 \
--ld-window-kb 50 \
--chr $i \
--maf 0.10 \
--out ld/20200724_1gk_maf-0.10_window-50kb_no-missing/$i;
done
New script here: mikk_genome/code/scripts/20200724_r2_decay_mean_1gk.R
mkdir ld/20200724_mean_r2_1gk
# TEST
Rscript --vanilla mikk_genome/code/scripts/20200724_r2_decay_mean_1gk.R \
ld/20200724_1gk_maf-0.10_window-50kb_no-missing/22.ld \
ld/20200724_mean_r2_1gk
# WORKS
# TRUE
for i in $(find ld/20200724_1gk_maf-0.10_window-50kb_no-missing/*.ld); do
name=$(basename $i | cut -f1 -d".");
bsub -M 10000 -o log/20200724_$name\_mean-r2.out -e log/20200724_$name\_mean-r2.err "Rscript --vanilla mikk_genome/code/scripts/20200724_r2_decay_mean_1gk.R $i ld/20200724_mean_r2_1gk";
done
# Pull to local
scp -r brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/ld/20200724_mean_r2_1gk/ ~/Documents/Data/20200707_mikk_ld/
data_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200724_mean_r2_1gk/",
full.names = T)
data_files_trunc <- list.files("~/Documents/Data/20200707_mikk_ld/20200724_mean_r2_1gk/")
data_files_trunc <- gsub(".txt", "", data_files_trunc)
data_list <- lapply(data_files, function(data_file){
df <- read.delim(data_file,
sep = "\t",
header = T)
#names(df) <- c("chr", "snp_1", "snp_2", "count", "r2")
return(df)
})
names(data_list) <- as.integer(data_files_trunc)
# reorder
data_list <- data_list[order(as.integer(names(data_list)))]
# bind into DF
r2_df <- dplyr::bind_rows(data_list, .id = "chr")
r2_df$chr <- factor(r2_df$chr, levels = seq(1, 22))
# get kb measure
r2_df$bin_bdr_kb <- r2_df$bin_bdr / 1000
r2_df %>% ggplot() +
geom_line(aes(bin_bdr, mean, colour = chr)) +
theme_bw() +
xlab("Distance beetween SNPs (bp)") +
ylab(bquote(.("Mean r")^2)) +
labs(colour = "Chromosome")
ggsave(filename = paste("20200724_mean-r2_1kg", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 13,
units = "cm",
dpi = 500)
r2_df %>% ggplot() +
geom_line(aes(bin_bdr_kb, mean, colour = chr)) +
theme_bw() +
xlab("Distance beetween SNPs (Kb)") +
ylab(bquote(.("Mean r")^2)) +
facet_wrap(~chr, nrow = 6, ncol = 4) +
guides(colour = F) +
theme(axis.text = element_text(size = 8),
strip.text = element_text(size = 8),
panel.grid = element_blank())
ggsave(filename = paste("20200724_mean-r2_facet_50kb_1KG", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 20,
units = "cm",
dpi = 500)
mkdir ld/20200724_mean_r2_1kb-max_1gk
# TRUE
for i in $(find ld/20200724_1gk_maf-0.10_window-50kb_no-missing/*.ld); do
name=$(basename $i | cut -f1 -d".");
bsub -M 10000 -o log/20200724_$name\_mean-r2_1kb-max.out -e log/20200724_$name\_mean-r2_1kb-max.err "Rscript --vanilla mikk_genome/code/scripts/20200724_r2_decay_mean_1gk_1kb-lim.R $i ld/20200724_mean_r2_1kb-max_1gk";
done
# Pull to local
scp -r brettell@ebi:/hps/research1/birney/users/ian/mikk_paper/ld/20200724_mean_r2_1kb-max_1gk ~/Documents/Data/20200707_mikk_ld/
data_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200724_mean_r2_1kb-max_1gk/",
full.names = T)
data_files_trunc <- list.files("~/Documents/Data/20200707_mikk_ld/20200724_mean_r2_1kb-max_1gk/")
data_files_trunc <- gsub(".txt", "", data_files_trunc)
data_list <- lapply(data_files, function(data_file){
df <- read.delim(data_file,
sep = "\t",
header = T)
#names(df) <- c("chr", "snp_1", "snp_2", "count", "r2")
return(df)
})
names(data_list) <- as.integer(data_files_trunc)
# reorder
data_list <- data_list[order(as.integer(names(data_list)))]
# bind into DF
r2_df_1kb_1gk <- dplyr::bind_rows(data_list, .id = "chr")
r2_df_1kb_1gk$chr <- factor(r2_df_1kb_1gk$chr, levels = seq(1, 24))
# get kb measure
r2_df_1kb_1gk$bin_bdr_kb <- r2_df_1kb_1gk$bin_bdr / 1000
data_files <- list.files("~/Documents/Data/20200707_mikk_ld/20200715_mean_r2_1kb-max/",
full.names = T)
data_files_trunc <- list.files("~/Documents/Data/20200707_mikk_ld/20200715_mean_r2_1kb-max/")
data_files_trunc <- gsub(".txt", "", data_files_trunc)
data_list <- lapply(data_files, function(data_file){
df <- read.delim(data_file,
sep = "\t",
header = T)
#names(df) <- c("chr", "snp_1", "snp_2", "count", "r2")
return(df)
})
names(data_list) <- as.integer(data_files_trunc)
# reorder
data_list <- data_list[order(as.integer(names(data_list)))]
# bind into DF
r2_df_1kb_mikk <- dplyr::bind_rows(data_list, .id = "chr")
r2_df_1kb_mikk$chr <- factor(r2_df_1kb_mikk$chr, levels = seq(1, 24))
# get kb measure
r2_df_1kb_mikk$bin_bdr_kb <- r2_df_1kb_mikk$bin_bdr / 1000
r2_df_1kb_1gk %>% ggplot() +
geom_line(aes(bin_bdr, mean, colour = chr)) +
theme_bw() +
xlab("Distance beetween SNPs (bp)") +
ylab(bquote(.("Mean r")^2)) +
labs(colour = "Chromosome")
ggsave(filename = paste("20200724_mean-r2_1kb-lim_1KG", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 13,
units = "cm",
dpi = 500)
r2_df_1kb_1gk %>% ggplot() +
geom_line(aes(bin_bdr, mean, colour = chr)) +
theme_bw() +
xlab("Distance beetween SNPs (bp)") +
ylab(bquote(.("Mean r")^2)) +
guides(colour = F)
ggsave(filename = paste("20200724_mean-r2_1kb-lim_no-guide_1KG", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 13,
units = "cm",
dpi = 500)
r2_final_lst <- list("1KG" = r2_df_1kb_1gk,
"MIKK" = r2_df_1kb_mikk)
r2_final_df <- dplyr::bind_rows(r2_final_lst, .id = "dataset")
r2_final_df %>% ggplot() +
geom_line(aes(bin_bdr, mean, colour = dataset)) +
theme_bw() +
xlab("Distance beetween SNPs (bp)") +
ylab(bquote(.("Mean r")^2)) +
facet_wrap(~chr, nrow = 6, ncol = 4) +
theme(axis.text = element_text(size = 8),
strip.text = element_text(size = 8),
panel.grid = element_blank()) +
scale_color_manual(values = c("#FC4E07", "#360568")) +
labs(colour = "")
ggsave(filename = paste("20200724_mean-r2_1kb-lim_1KGvMIKK", ".png", sep = ""),
device = "png",
path = "~/Documents/Docs/medaka\ pics/20200602_mikk_genome/",
width = 20,
height = 13,
units = "cm",
dpi = 500)
library(gaston)
# try with read.bed.matrix
mikk_full <- gaston::read.bed.matrix("~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716",
rds = NULL)
## split by chromosome
#mikk_full_split <- lapply(seq(1, 24), function(x){
# gaston::select.snps(mikk_full, chr == x)
#})
#names(mikk_full_split) <- seq(1, 24)
#
## make list with filtered bed.matrixes based on MAF
### create vector of target MAFs
#target_mafs <- c(0.03, 0.05)
### filter
#mikk_maf_filts <- lapply(target_mafs, function(x){
# lapply(mikk_full_split, function(y){
# gaston::select.snps(y, maf > x)
# })
#})
#names(mikk_maf_filts) <- target_mafs
mikk_geno <- readr::read_tsv(file = "~/Documents/Data/20200707_mikk_ld/20200716_plink_full_set/20200716_recode012.traw",
progress = T,
col_names = T)
Parsed with column specification:
cols(
.default = col_double(),
SNP = col_character(),
COUNTED = col_character(),
ALT = col_character()
)
See spec(...) for full column specifications.